{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Visualizing scikit-learn pipelines in Jupyter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## First we load the dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to define our data and target. In this case we build a classification\n",
    "model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "ames_housing = pd.read_csv(\"../datasets/house_prices.csv\", na_values=\"?\")\n",
    "\n",
    "target_name = \"SalePrice\"\n",
    "data, target = (\n",
    "    ames_housing.drop(columns=target_name),\n",
    "    ames_housing[target_name],\n",
    ")\n",
    "target = (target > 200_000).astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We inspect the first rows of the dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the sake of simplicity, we can cherry-pick some features and only retain\n",
    "this arbitrary subset of data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "numeric_features = [\"LotArea\", \"FullBath\", \"HalfBath\"]\n",
    "categorical_features = [\"Neighborhood\", \"HouseStyle\"]\n",
    "data = data[numeric_features + categorical_features]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Then we create the pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first step is to define the preprocessing steps"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.pipeline import Pipeline\n",
    "from sklearn.impute import SimpleImputer\n",
    "from sklearn.preprocessing import StandardScaler, OneHotEncoder\n",
    "\n",
    "numeric_transformer = Pipeline(\n",
    "    steps=[\n",
    "        (\"imputer\", SimpleImputer(strategy=\"median\")),\n",
    "        (\n",
    "            \"scaler\",\n",
    "            StandardScaler(),\n",
    "        ),\n",
    "    ]\n",
    ")\n",
    "\n",
    "categorical_transformer = OneHotEncoder(handle_unknown=\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next step is to apply the transformations using `ColumnTransformer`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.compose import ColumnTransformer\n",
    "\n",
    "preprocessor = ColumnTransformer(\n",
    "    transformers=[\n",
    "        (\"num\", numeric_transformer, numeric_features),\n",
    "        (\"cat\", categorical_transformer, categorical_features),\n",
    "    ]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Then we define the model and join the steps in order"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "model = Pipeline(\n",
    "    steps=[\n",
    "        (\"preprocessor\", preprocessor),\n",
    "        (\"classifier\", LogisticRegression()),\n",
    "    ]\n",
    ")\n",
    "model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's fit it!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model.fit(data, target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice that the diagram changes color once the estimator is fit.\n",
    "\n",
    "So far we used `Pipeline` and `ColumnTransformer`, which allows us to custom\n",
    "the names of the steps in the pipeline. An alternative is to use\n",
    "`make_column_transformer` and `make_pipeline`, they do not require, and do not\n",
    "permit, naming the estimators. Instead, their names are set to the lowercase\n",
    "of their types automatically."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.compose import make_column_transformer\n",
    "from sklearn.pipeline import make_pipeline\n",
    "\n",
    "numeric_transformer = make_pipeline(\n",
    "    SimpleImputer(strategy=\"median\"), StandardScaler()\n",
    ")\n",
    "categorical_transformer = OneHotEncoder(handle_unknown=\"ignore\")\n",
    "\n",
    "preprocessor = make_column_transformer(\n",
    "    (numeric_transformer, numeric_features),\n",
    "    (categorical_transformer, categorical_features),\n",
    ")\n",
    "model = make_pipeline(preprocessor, LogisticRegression())\n",
    "model.fit(data, target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Finally we can score the model using cross-validation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import cross_validate\n",
    "\n",
    "cv_results = cross_validate(model, data, target, cv=5)\n",
    "scores = cv_results[\"test_score\"]\n",
    "print(\n",
    "    \"The mean cross-validation accuracy is: \"\n",
    "    f\"{scores.mean():.3f} \u00b1 {scores.std():.3f}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p>In this case, around 86% of the times the pipeline correctly predicts whether\n",
    "the price of a house is above or below the 200_000 dollars threshold. But be\n",
    "aware that this score was obtained by picking some features by hand, which is\n",
    "not necessarily the best thing we can do for this classification task. In this\n",
    "example we can hope that fitting a complex machine learning pipelines on a\n",
    "richer set of features can improve upon this performance level.</p>\n",
    "<p class=\"last\">Reducing a price estimation problem to a binary classification problem with a\n",
    "single threshold at 200_000 dollars is probably too coarse to be useful in in\n",
    "practice. Treating this problem as a regression problem is probably a better\n",
    "idea. We will see later in this MOOC how to train and evaluate the performance\n",
    "of various regression models.</p>\n",
    "</div>"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}